Learning a Visual Forward Model for a Robot Camera Head
نویسنده
چکیده
Visual forward models predict future visual data from the previous visual sensory state and a motor command. The adaptive acquisition of visual forward models in robotic applications is plagued by the high dimensionality of visual data which is not handled well by most machine learning and neural network algorithms. Moreover, the forward model has to learn which parts of the visual output are really predictable and which are not. In the present study, a learning algorithm is proposed which solves both problems. It relies on predicting the mapping between the visual input and output instead of directly forecasting visual data. The mapping is learnt by matching corresponding regions in the visual input and output while exploring different visual surroundings. Unpredictable regions are detected by the lack of any clear correspondence. The proposed algorithm is applied successfully to a robot camera head with additional distortion of the camera images by a retinal mapping. 1 Visuomotor Prediction Sensorimotor control is an important research topic in many disciplines, among them cognitive science and robotics. These fields tackle the questions how complex motor skills can be acquired by biological organisms or robots, and how sensory and motor processing are interrelated to each other. So-called “internal models” help to clarify ideas of sensorimotor processing on a functional level [8, 13]. “Inverse models” or controllers generate motor commands based on the current sensory state and the desired one; “forward models” (FWM) predict future sensory states as outcome of motor commands applied in the current sensory state. The present study focuses on the anticipation of visual data by FWMs. The anticipation of sensory consequences in the nervous system of biological organisms is supposed to be involved in several sensorimotor processes: First, many motor actions rely on feedback control, but sensory feedback is generally too slow. Here, the output of FWMs can replace sensory feedback [9]. Second, FWMs may be used in the planning process for complex motor actions [12]. Third, FWMs are part of a controller learning scheme called “distal supervised learning” [7]. Fourth, FWMs can help to seperate self-induced sensory effects (which are predicted) from externally induced sensory effects (which stand out from the predicted background) [2]. Fifth, it is suggested that perception relies on the anticipation of the consequences of motor actions which could be applied in the current situation. For the anticipation, FWMs are needed [10]. Regarding the fourth function mentioned above, a classical example is the reafference principle suggested by Holst and Mittelstaedt [6]. It explains why (self-induced) eye movements do not evoke the impression that the world around us is moving. As long as the predicted movement of the retinal image (caused by the eye movement) coincides with the actual movement, the effect of this movement is canceled out in the visual perception. In fields like robotics or artificial life, studies using FWMs for motor control focus mainly on navigation or obstacle avoidance tasks with mobile robots. The sensory input to the FWMs are rather low-dimensional data from distance sensors or laser range finders (e.g.: [12, 14]), optical flow fields [3], or preprocessed visual data with only a few remaining dimensions [5]. We are especially interested in the learning of FWMs in the visual domain, and its application to robot models. In our understanding, visual FWMs predict representations of entire visual scenes. In the nervous system, this could be the relatively unprocessed representation in the primary visual cortex or more complex representations generated in higher visual areas. Regarding robot models, the high-dimensional sensory input and output space of visual FWMs poses a tough challenge to any machine learning or neural network algorithm. Moreover, there might be unpredictable regions in the FWM output (because parts of the visual surrounding only become visible after execution of the motor command). In the present study, we suggest a learning algorithm which solves both problems in the context of robot “eye” movements. In doing so, our main goal is to demonstrate a new efficient learning algorithm for image prediction. 2 Visual Forward Model for Camera Movements In our robot model, we attempt to predict the visual consequences of eye movements. In the model, the eye is replaced by a camera which is mounted on a pantilt unit. Prediction of visual data is carried out on the level of camera images. In analogy to the sensor distribution on the human retina, a retinal mapping is carried out which decreases the resolution of the camera images from center to border. We use this mapping to make the prediction task more difficult; we do not intend to develop, implement, or test a model of the human visual pathway. The input of the visual FWM is a “retinal image” at time step t (called “input image” in the following) and a motor command mt. The output is a prediction of the retinal image at the next time step t + 1 (called “output image” in the following; see left part of Fig. 1). The question is how such an adaptive visual FWM can be implemented and trained by exploration of the environment. A straight-forward approach is the use of function approximators which predict the intensity of single pixels. For every pixel 〈xOut, yOut〉 of the output image, a specific forward model FWM〈xOut,yOut〉 is acquired which forecasts the intensity of this pixel (see right part of Fig. 1). Together, the predictions of these single FWMs form the output image as in Fig. Fig. 1. Left: Visual forward model (FWM). Right: Single component of a visual forward model predicting the intensity of a single pixel 〈xOut, yOut〉 of the output image. Fig. 2. Left: Mapping model (MM). Right: Validator model (VM) (for details see text). 1 (left). Unfortunately, this simple approach suffers from the high dimensionality of the input space (the retinal image at time step t is part of the input), and does not produce satisfactory learning results [4]. Hence, in this study we pursue a different approach. Instead of forecasting pixel intensities directly, our solution is based on a “back” prediction of where a pixel of the output image was in the input image before the camera’s movement. The necessary mapping model (MM) is depicted in Fig. 2: As input, it receives the motor command mt and the location of a single pixel 〈xOut, yOut〉 of the output image; as output it estimates the previous location 〈x̂In, ŷIn〉 of the corresponding pixel (or region) in the input image. The overall output image is constructed by iterating through all of its pixels and computing each pixel intensity as Î 〈xOut,yOut〉 = I In 〈x̂In,ŷIn〉 (using bilinear interpolation). 1 Moreover, an additional validator model (VM) generates a signal v〈xOut,yOut〉 indicating whether it is possible at all for the MM to generate a valid output for the current input. This is necessary because even for small camera movements parts of the output image are not present in the input image. In this way, the overall FWM (Fig. 1, left) is implemented by the combined application of a mapping and a validator model. The basic idea of the learning algorithm for the MM is outlined in the following for a specific mt and 〈xOut, yOut〉. During learning, the motor command is carried out in different environmental settings. Each time, both the actual input and output image are known afterwards, thus the intensity I 〈xOut,yOut〉 is known as well. It is possible to determine which of the pixels of the input image show a similar intensity. These pixels are candidates for the original position 〈xIn, yIn〉 of the pixel 〈xOut, yOut〉 before the movement. Over many trials, the pixel in the input image which matches most often is the most likely candidate for 〈xIn, yIn〉 1 In this study, pixel intensities of the retinal input and output images are threedimensional vectors in RGB color space. and chosen as MM output 〈x̂In, ŷIn〉. When none of the pixels matches often enough, the MM output is marked as non-valid (output of VM).
منابع مشابه
Active Visual Alignment of a Mobile Stereo Camera Platform
We present a complete system for automatic alignment and calibration of a stereo pan-tilt camera platform on a mobile robot. The system uses visual data from one or two controlled rotations of the head, and a single forward motion of the robot. We show how the images alone provide head alignment information, camera calibration, and head geometry. We also discuss automatic zeroing of steering an...
متن کاملVisual Tracking using Learning Histogram of Oriented Gradients by SVM on Mobile Robot
The intelligence of a mobile robot is highly dependent on its vision. The main objective of an intelligent mobile robot is in its ability to the online image processing, object detection, and especially visual tracking which is a complex task in stochastic environments. Tracking algorithms suffer from sequence challenges such as illumination variation, occlusion, and background clutter, so an a...
متن کاملTraining and Application of a Visual Forward Model for a Robot Camera Head
Visual forward models predict future visual data from the previous visual sensory state and a motor command. The adaptive acquisition of visual forward models in robotic applications is plagued by the high dimensionality of visual data which is not handled well by most machine learning and neural network algorithms. Moreover, the forward model has to learn which parts of the visual output are r...
متن کاملPlace Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning
We model hippocampal place cells and head-direction cells by combining allothetic (visual) and idiothetic (proprioceptive) stimuli. Visual input, provided by a video camera on a miniature robot, is preprocessed by a set of Gabor filters on 31 nodes of a log-polar retinotopic graph. Unsupervised Hebbian learning is employed to incrementally build a population of localized overlapping place field...
متن کاملAn Unsupervised Learning Method for an Attacker Agent in Robot Soccer Competitions Based on the Kohonen Neural Network
RoboCup competition as a great test-bed, has turned to a worldwide popular domains in recent years. The main object of such competitions is to deal with complex behavior of systems whichconsist of multiple autonomous agents. The rich experience of human soccer player can be used as a valuable reference for a robot soccer player. However, because of the differences between real and simulated soc...
متن کامل